#COBS: A Background

First, let’s confirm that we have the required tools for asking questions of the NCBI database. To do this we will use the NCBI API: Entrez Direct and install it via the UNIX command line.

pwd
cd ~
perl -MNet::FTP -e \
  '$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1);
   $ftp->login; $ftp->binary;
   $ftp->get("/entrez/entrezdirect/edirect.zip");'
unzip -u -q edirect.zip
rm edirect.zip
export PATH=$PATH:$HOME/edirect
./edirect/setup.sh

Let’s take a look to confirm that it was installed in the home directory:

cd ~
ls

Now that we have the tools, what will we use them for? COBS is a unique dataset that includes many metagenomes from soil aggregates. Many functions in ecosystems are microbially mediated and catalyzed by enzymes. In this study, funded by the DOE, we are interested in carnon cycleing and therefore the enyzmes associated with that funciton. Our collaborator (Kirsten Hofmockel) has identified a few enzymes of interest for our system. The list of enzymes is contained in a text file: ec_numbers.txt Let’s navigate to that directory and take a look:

cd ~/Documents/COBS_CAZY
cat ec_numbers.txt

We now have a list of enzymes of interest, let’s use those NCBI tools to find bacterial sequences associated with those enzymes by creating a python script:

import sys
from Bio import Entrez, SeqIO

Entrez.email = 'jflater@iastate.edu'

# First, find entries that contain the E.C. number
ec_num = sys.argv[1].strip()
#print ec_num
#print 'E.C. '+ ec_num
esearch_handle = Entrez.esearch(db='nucleotide', term='E.C. '+ec_num)
entries = Entrez.read(esearch_handle)
esearch_handle.close()

# Second, fetch these entries
efetch_handle = Entrez.efetch(db='nucleotide', id=entries['IdList'], rettype='gb', retmode='xml') 
records = Entrez.parse(efetch_handle)

# Now, we go through the records and look for a feature with name 'EC_number'
missing = []
for record in records:
  try:
      for feature in record['GBSeq_feature-table']:
          for subfeature in feature['GBFeature_quals']:
              if (subfeature['GBQualifier_name'] == 'EC_number'   and
                subfeature['GBQualifier_value'] == ec_num):

                    # If we found it, we extract the seq's start and end
                    accession = record['GBSeq_primary-accession']
                    interval = feature['GBFeature_intervals'][0]
                    interval_start = interval['GBInterval_from']
                    interval_end = interval['GBInterval_to']
                    location = feature['GBFeature_location']
                    if location.startswith('complement'):
                        strand = 2
                    else:
                        strand = 1

                    # Now we fetch the nucleotide sequence
                    handle = Entrez.efetch(db="nucleotide", id=accession,
                                           rettype="fasta", strand=strand,
                                           seq_start = interval_start,
                                           seq_stop = interval_end)
                    seq = SeqIO.read(handle, "fasta")

                    print('>GenBank Accession:{}'.format(accession))
                    print(seq.seq)
  except ValueError: ## if whatever you are trying to do did not go through
        missing.append(records.id) ## write the stuff to list "missing"
print missing
#fp.write('\n'.join(missing))
efetch_handle.close()

Now let’s apply that script to our list of EC #’s by using the command line:

while read line;     
  do python scripts/nucl_from_ec.py $line > "$line".txt;    
  done < ec_numbers.txt

This command line script will go through our list of EC numbers and retrieve the associated nucleotide sequences and output a file for each EC # with those sequences in fasta format.

LS0tCnRpdGxlOiAiQ09CU19DQVpZIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKcm9vdDogLi4vLi4vLi4vCi0tLQohW10oaW1hZ2VzL0NPQlNQSUMucG5nKQotLS0tLS0KI0NPQlM6IEEgQmFja2dyb3VuZAotLS0tLS0KRmlyc3QsIGxldCdzIGNvbmZpcm0gdGhhdCB3ZSBoYXZlIHRoZSByZXF1aXJlZCB0b29scyBmb3IgYXNraW5nIHF1ZXN0aW9ucyBvZiB0aGUgTkNCSSBkYXRhYmFzZS4gVG8gZG8gdGhpcyB3ZSB3aWxsIHVzZSB0aGUgTkNCSSBBUEk6IEVudHJleiBEaXJlY3QgYW5kIGluc3RhbGwgaXQgdmlhIHRoZSBVTklYIGNvbW1hbmQgbGluZS4gCmBgYHtiYXNoLCBldmFsPUZBTFNFfQpwd2QKY2QgfgpwZXJsIC1NTmV0OjpGVFAgLWUgXAogICckZnRwID0gbmV3IE5ldDo6RlRQKCJmdHAubmNiaS5ubG0ubmloLmdvdiIsIFBhc3NpdmUgPT4gMSk7CiAgICRmdHAtPmxvZ2luOyAkZnRwLT5iaW5hcnk7CiAgICRmdHAtPmdldCgiL2VudHJlei9lbnRyZXpkaXJlY3QvZWRpcmVjdC56aXAiKTsnCnVuemlwIC11IC1xIGVkaXJlY3QuemlwCnJtIGVkaXJlY3QuemlwCmV4cG9ydCBQQVRIPSRQQVRIOiRIT01FL2VkaXJlY3QKLi9lZGlyZWN0L3NldHVwLnNoCmBgYApMZXQncyB0YWtlIGEgbG9vayB0byBjb25maXJtIHRoYXQgaXQgd2FzIGluc3RhbGxlZCBpbiB0aGUgaG9tZSBkaXJlY3Rvcnk6CmBgYHtiYXNofQpjZCB+CmxzCmBgYApOb3cgdGhhdCB3ZSBoYXZlIHRoZSB0b29scywgd2hhdCB3aWxsIHdlIHVzZSB0aGVtIGZvcj8gQ09CUyBpcyBhIHVuaXF1ZSBkYXRhc2V0IHRoYXQgaW5jbHVkZXMgbWFueSBtZXRhZ2Vub21lcyBmcm9tIHNvaWwgYWdncmVnYXRlcy4gTWFueSBmdW5jdGlvbnMgaW4gZWNvc3lzdGVtcyBhcmUgbWljcm9iaWFsbHkgbWVkaWF0ZWQgYW5kIGNhdGFseXplZCBieSBlbnp5bWVzLiBJbiB0aGlzIHN0dWR5LCBmdW5kZWQgYnkgdGhlIERPRSwgd2UgYXJlIGludGVyZXN0ZWQgaW4gY2Fybm9uIGN5Y2xlaW5nIGFuZCB0aGVyZWZvcmUgdGhlIGVueXptZXMgYXNzb2NpYXRlZCB3aXRoIHRoYXQgZnVuY2l0b24uIE91ciBjb2xsYWJvcmF0b3IgKEtpcnN0ZW4gSG9mbW9ja2VsKSBoYXMgaWRlbnRpZmllZCBhIGZldyBlbnp5bWVzIG9mIGludGVyZXN0IGZvciBvdXIgc3lzdGVtLiBUaGUgbGlzdCBvZiBlbnp5bWVzIGlzIGNvbnRhaW5lZCBpbiBhIHRleHQgZmlsZTogZWNfbnVtYmVycy50eHQKTGV0J3MgbmF2aWdhdGUgdG8gdGhhdCBkaXJlY3RvcnkgYW5kIHRha2UgYSBsb29rOgpgYGB7YmFzaH0KY2Qgfi9Eb2N1bWVudHMvQ09CU19DQVpZCmNhdCBlY19udW1iZXJzLnR4dApgYGAKV2Ugbm93IGhhdmUgYSBsaXN0IG9mIGVuenltZXMgb2YgaW50ZXJlc3QsIGxldCdzIHVzZSB0aG9zZSBOQ0JJIHRvb2xzIHRvIGZpbmQgYmFjdGVyaWFsIHNlcXVlbmNlcyBhc3NvY2lhdGVkIHdpdGggdGhvc2UgZW56eW1lcyBieSBjcmVhdGluZyBhIHB5dGhvbiBzY3JpcHQ6ICAKYGBge3B5dGhvbiwgZXZhbD1GQUxTRX0KaW1wb3J0IHN5cwpmcm9tIEJpbyBpbXBvcnQgRW50cmV6LCBTZXFJTwoKRW50cmV6LmVtYWlsID0gJ2pmbGF0ZXJAaWFzdGF0ZS5lZHUnCgojIEZpcnN0LCBmaW5kIGVudHJpZXMgdGhhdCBjb250YWluIHRoZSBFLkMuIG51bWJlcgplY19udW0gPSBzeXMuYXJndlsxXS5zdHJpcCgpCiNwcmludCBlY19udW0KI3ByaW50ICdFLkMuICcrIGVjX251bQplc2VhcmNoX2hhbmRsZSA9IEVudHJlei5lc2VhcmNoKGRiPSdudWNsZW90aWRlJywgdGVybT0nRS5DLiAnK2VjX251bSkKZW50cmllcyA9IEVudHJlei5yZWFkKGVzZWFyY2hfaGFuZGxlKQplc2VhcmNoX2hhbmRsZS5jbG9zZSgpCgojIFNlY29uZCwgZmV0Y2ggdGhlc2UgZW50cmllcwplZmV0Y2hfaGFuZGxlID0gRW50cmV6LmVmZXRjaChkYj0nbnVjbGVvdGlkZScsIGlkPWVudHJpZXNbJ0lkTGlzdCddLCByZXR0eXBlPSdnYicsIHJldG1vZGU9J3htbCcpIApyZWNvcmRzID0gRW50cmV6LnBhcnNlKGVmZXRjaF9oYW5kbGUpCgojIE5vdywgd2UgZ28gdGhyb3VnaCB0aGUgcmVjb3JkcyBhbmQgbG9vayBmb3IgYSBmZWF0dXJlIHdpdGggbmFtZSAnRUNfbnVtYmVyJwptaXNzaW5nID0gW10KZm9yIHJlY29yZCBpbiByZWNvcmRzOgogIHRyeToKICAgICAgZm9yIGZlYXR1cmUgaW4gcmVjb3JkWydHQlNlcV9mZWF0dXJlLXRhYmxlJ106CiAgICAgICAgICBmb3Igc3ViZmVhdHVyZSBpbiBmZWF0dXJlWydHQkZlYXR1cmVfcXVhbHMnXToKICAgICAgICAgICAgICBpZiAoc3ViZmVhdHVyZVsnR0JRdWFsaWZpZXJfbmFtZSddID09ICdFQ19udW1iZXInICAgYW5kCiAgICAgICAgICAgICAgICBzdWJmZWF0dXJlWydHQlF1YWxpZmllcl92YWx1ZSddID09IGVjX251bSk6CgogICAgICAgICAgICAgICAgICAgICMgSWYgd2UgZm91bmQgaXQsIHdlIGV4dHJhY3QgdGhlIHNlcSdzIHN0YXJ0IGFuZCBlbmQKICAgICAgICAgICAgICAgICAgICBhY2Nlc3Npb24gPSByZWNvcmRbJ0dCU2VxX3ByaW1hcnktYWNjZXNzaW9uJ10KICAgICAgICAgICAgICAgICAgICBpbnRlcnZhbCA9IGZlYXR1cmVbJ0dCRmVhdHVyZV9pbnRlcnZhbHMnXVswXQogICAgICAgICAgICAgICAgICAgIGludGVydmFsX3N0YXJ0ID0gaW50ZXJ2YWxbJ0dCSW50ZXJ2YWxfZnJvbSddCiAgICAgICAgICAgICAgICAgICAgaW50ZXJ2YWxfZW5kID0gaW50ZXJ2YWxbJ0dCSW50ZXJ2YWxfdG8nXQogICAgICAgICAgICAgICAgICAgIGxvY2F0aW9uID0gZmVhdHVyZVsnR0JGZWF0dXJlX2xvY2F0aW9uJ10KICAgICAgICAgICAgICAgICAgICBpZiBsb2NhdGlvbi5zdGFydHN3aXRoKCdjb21wbGVtZW50Jyk6CiAgICAgICAgICAgICAgICAgICAgICAgIHN0cmFuZCA9IDIKICAgICAgICAgICAgICAgICAgICBlbHNlOgogICAgICAgICAgICAgICAgICAgICAgICBzdHJhbmQgPSAxCgogICAgICAgICAgICAgICAgICAgICMgTm93IHdlIGZldGNoIHRoZSBudWNsZW90aWRlIHNlcXVlbmNlCiAgICAgICAgICAgICAgICAgICAgaGFuZGxlID0gRW50cmV6LmVmZXRjaChkYj0ibnVjbGVvdGlkZSIsIGlkPWFjY2Vzc2lvbiwKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHJldHR5cGU9ImZhc3RhIiwgc3RyYW5kPXN0cmFuZCwKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHNlcV9zdGFydCA9IGludGVydmFsX3N0YXJ0LAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgc2VxX3N0b3AgPSBpbnRlcnZhbF9lbmQpCiAgICAgICAgICAgICAgICAgICAgc2VxID0gU2VxSU8ucmVhZChoYW5kbGUsICJmYXN0YSIpCgogICAgICAgICAgICAgICAgICAgIHByaW50KCc+R2VuQmFuayBBY2Nlc3Npb246e30nLmZvcm1hdChhY2Nlc3Npb24pKQogICAgICAgICAgICAgICAgICAgIHByaW50KHNlcS5zZXEpCiAgZXhjZXB0IFZhbHVlRXJyb3I6ICMjIGlmIHdoYXRldmVyIHlvdSBhcmUgdHJ5aW5nIHRvIGRvIGRpZCBub3QgZ28gdGhyb3VnaAogICAgICAgIG1pc3NpbmcuYXBwZW5kKHJlY29yZHMuaWQpICMjIHdyaXRlIHRoZSBzdHVmZiB0byBsaXN0ICJtaXNzaW5nIgpwcmludCBtaXNzaW5nCiNmcC53cml0ZSgnXG4nLmpvaW4obWlzc2luZykpCmVmZXRjaF9oYW5kbGUuY2xvc2UoKQpgYGAKTm93IGxldCdzIGFwcGx5IHRoYXQgc2NyaXB0IHRvIG91ciBsaXN0IG9mIEVDICMncyBieSB1c2luZyB0aGUgY29tbWFuZCBsaW5lOgpgYGB7YmFzaCwgZXZhbD1GQUxTRX0Kd2hpbGUgcmVhZCBsaW5lOyAgICAgCiAgZG8gcHl0aG9uIHNjcmlwdHMvbnVjbF9mcm9tX2VjLnB5ICRsaW5lID4gIiRsaW5lIi50eHQ7ICAgIAogIGRvbmUgPCBlY19udW1iZXJzLnR4dApgYGAKVGhpcyBjb21tYW5kIGxpbmUgc2NyaXB0IHdpbGwgZ28gdGhyb3VnaCBvdXIgbGlzdCBvZiBFQyBudW1iZXJzIGFuZCByZXRyaWV2ZSB0aGUgYXNzb2NpYXRlZCBudWNsZW90aWRlIHNlcXVlbmNlcyBhbmQgb3V0cHV0IGEgZmlsZSBmb3IgZWFjaCBFQyAjIHdpdGggdGhvc2Ugc2VxdWVuY2VzIGluIGZhc3RhIGZvcm1hdC4gCg==